Efficient replication of changes to a byte-addressable persistent memory over a network

ABSTRACT

A system and method for efficiently replicating data stored in a byte-addressable, persistent memory of a host computer. A user-level library of the host computer may configure the persistent memory as a software transactional memory (STM) system defined by operations, such as a STM commit operation, that ensure safe and consistent storage of the data within a region of the persistent memory. The library may then cooperate with an application executing on the host computer to control access to the data, e.g., to change the data, as a transaction using the STM commit operation. Within a context of the transaction, the library may precisely determine which bytes of the data have changed within the region, as well as how and when the data bytes have changed. Armed with precise knowledge of the context of the transaction, the library may efficiently replicate the changed data at the byte-addressable granularity.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 14/928,892 entitled Efficient Replication of Changes to aByte-Addressable Persistent Memory Over a Network, filed on Oct. 30,2015 by Douglas Joseph Santry, which is a continuation of U.S. patentapplication Ser. No. 13/901,201, now issued as U.S. Pat. No. 9,201,609entitled Efficient Replication of Changes to a Byte-AddressablePersistent Memory Over a Network, filed on May 23, 2013 by DouglasJoseph Santry, which applications are hereby incorporated by reference.

BACKGROUND Technical Field

The present disclosure relates to replication of data and, morespecifically, to efficient replication of data in a network environmenthaving a host computer with byte-addressable persistent memory.

Background Information

Many modern computing algorithms are page-based and implemented in akernel of an operating system executing on a host computer. Paging is amemory management function that facilitates storage and retrieval ofdata in blocks or “pages” to and from primary storage, such as disks.For example, assume that a page contains 4k bytes of data. Anapplication executing on the host computer may utilize a page-basedalgorithm to, e.g., insert a new node into a doubly-linked list.Execution of the algorithm may result in a first modified (“dirtied”)page, i.e., the page with a previous pointer, a second dirtied page,i.e., the page with a next pointer, and a third dirtied page containingthe newly written node. Accordingly, execution of the page-based nodeinsertion algorithm results in three (3) dirty pages or 12 k bytes ofdata.

The advent of byte-addressable persistent memory, such as storage classmemory, may accelerate adoption of primary storage to reside on a memorybus of the host computer, as well as acceptance of “in-memory”computing. The persistent memory may be configured to enableapplications executing on the host computer to safely and consistentlymodify (change) their data at a byte addressable granularity to, e.g.,survive failures. For instance, execution of the node insertionalgorithm at a byte-addressable granularity results in approximately 50bytes of changed data. Yet, even safe and consistent data stored in thepersistent memory may be vulnerable in the event of a disaster becausethere is only a single copy of the data on the host computer.

Therefore, there is a need to replicate the changed data, e.g., to oneor more remote machines connected to the host computer over a network tothereby allow recovery from a disaster. However, in order to replicate,for example, the changed data of the page-based node insertion algorithmto a remote machine, the kernel is forced to copy 12 k bytes of dataover the network. This approach is clearly inefficient for changes todata in byte-addressable persistent memory of a network environment.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the embodiments herein may be betterunderstood by referring to the following description in conjunction withthe accompanying drawings in which like reference numerals indicateidentically or functionally similar elements, of which:

FIG. 1 is a block diagram of a network environment;

FIG. 2 is a block diagram of a host computer of the network environment;

FIG. 3 is a block diagram of a splinter;

FIG. 4 is a block diagram of a replication group; and

FIG. 5 is an example simplified procedure for replicating data stored ina byte-addressable, persistent memory of the host computer.

DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

The embodiments described herein provide a system and method forefficiently replicating data in a network environment having a hostcomputer with byte-addressable persistent memory. A user-level libraryof the host computer may configure the persistent memory as a softwaretransactional memory (STM) system defined by operations, such as a STMcommit operation, that ensure safe and consistent storage of the datawithin a region of the persistent memory. The library may then cooperatewith an application executing on the host computer to control access tothe data, e.g., to change the data, stored in the region of thepersistent memory as a transaction using the STM commit operation.Within a context of the transaction, the library may precisely determinewhich bytes of the data have changed within the region, as well as howand when the data bytes have changed. Armed with precise knowledge ofthe context of the transaction, the library may efficiently replicate(i.e., copy) the changed data at the granularity at which it wasmodified, e.g., at the byte-addressable granularity.

In one or more embodiments, the library may initiate replication of thedata by forwarding the changed data to a network adapter of the hostcomputer as one or more splinters associated with the transaction.Illustratively, a splinter may contain information such as a starting orbase memory address of the changed data within the region, a length ofthe changed data and a string of bytes of the changed data. The networkadapter may thereafter forward each splinter over the computer network,e.g., within one or more frames, to one of a plurality of remote storageservers having storage devices, such as disks, organized as areplication group for the region. As described herein, the informationcontained with the splinter of the transaction may be stored on a diskof the replication group using either a synchronous or asynchronous modeof replication.

In one or more embodiments, selection of the disk to store the splintermay be determined in accordance with an equivalence class technique.Illustratively, the equivalence class technique may logically apportionan address space of the region, as defined by a multi-bit memoryaddress, into a number of equivalence classes defined by a predeterminednumber of high bits of the multi-bit memory address. In addition, eachequivalence class may have a storage subspace defined by a predeterminednumber of low bits of the multi-bit memory address. The storagesubspaces of the equivalence classes may then be assigned to the disksof the replication group using modulus arithmetic, e.g., [number ofequivalence classes] mod [number of disks]. The selected disk of thereplication group may thereafter be determined by mapping the basememory address of the splinter to the assigned storage subspaces againusing modulus arithmetic. Accordingly, the equivalence class techniquemay be employed to substantially uniformly distribute the splinters ofthe transaction over the disks of the replication group.

Description

FIG. 1 is a block diagram of a network environment 100 that may beadvantageously used with one or more embodiments described herein. Theenvironment 100 may include a host computer 200 coupled to a plurality(e.g., a cluster) of storage servers 110 over a computer network 150.The computer network 150 may include one or more point-to-point links,wireless links, a shared local area network, a wide area network or avirtual private network implemented over a public network, such as thewell-known Internet, although, in an embodiment, the computer network150 is illustratively an Ethernet network. The environment 100 may alsoinclude a master server 160 configured to manage the cluster of storageservers 110. The master server 160 may be located anywhere on thenetwork 150, such as on host computer 200 or on a storage server 110;however, in an embodiment, the master server 160 is illustrativelylocated on a separate administrative computer.

Each storage server 110 may be embodied as a computer, such as a storagesystem, storage appliance such as a filer, or a blade running a userlevel process, configured to provide storage services to the hostcomputer 200. As such, each storage server 110 includes computing andmemory elements coupled to one or more storage devices, such as disks120. The host computer 200 may communicate with the storage servers 110using discrete messages or splinters 300 contained within frames 170,such as Ethernet frames, that are transmitted over the network 150 usinga variety of communication protocols including, inter alia, wirelessprotocols and/or Ethernet protocols. However, in an embodiment describedherein, the frame 170 is illustratively encapsulated within a UserDatagram Protocol/Internet Protocol (UDP/IP) messaging protocol.

FIG. 2 is a block diagram of host computer 200 that may beadvantageously used with one or more embodiments described herein. Thehost computer 200 illustratively includes a processor 210 connected to apersistent memory 220 over a memory bus 250 and connected to a networkadapter 230 over a system bus 240. The network adapter 230 may includethe mechanical, electrical and signaling circuitry needed to connect thehost computer 200 to the storage servers 110 over computer network 150.The network adapter 230 may also include logic circuitry configured totransmit frames 170 containing the splinters 300 over the network 150 inaccordance with one or more operational modes that replicate informationcontained in the splinters on the disks 120 of the storage servers 110.

The persistent memory 220 may illustratively be embodied as non-volatilememory, such as storage class memory, having characteristics thatinclude, e.g., byte addressability of data organized as logicalconstructs, such a file or region 228, in the memory. The byteaddressable, persistent memory 220 may include memory locations that areaddressable by the processor 210 for storing software programs and datastructures associated with the embodiments described herein. Theprocessor 210 may, in turn, include processing elements and/or logiccircuitry configured to execute the software programs, such asuser-level library 225, and manipulate the data structures, such astransaction 224. An operating system kernel 226, portions of which aretypically resident in persistent memory 220 and executed by theprocessing elements, functionally organizes the host computer by, interalia, invoking operations in support of one or more applications 222executing on the computer. Illustratively, the application 222 may beimplemented via a process that includes a plurality of threads. It willbe apparent to those skilled in the art that other processing and memorymeans, including various computer readable media, may be used to storeand execute program instructions pertaining to the embodiments herein.

As used herein, the region 228 may be a logically contiguous addressspace that is backed physically with the persistent memory 220. Theregion 228 may be mapped into an address space of the application (i.e.,process) to enable modification, e.g., writing, of the region 228 by theapplication. Once the region is mapped into the application's addressspace, the user-level library 225 may control access to the region. Thatis, the application 222 may read and/or write data stored in the regionof the locally attached persistent memory through the library 225. As aresult, the user-level library 225 may operate as a control point foraccessing the persistent memory 220, thereby circumventing the operatingsystem kernel 226.

In an embodiment, the user-level library 225 may configure thepersistent memory 220 as a software transactional memory (STM) systemdefined by operations, such as a STM commit operation, that ensure safeand consistent storage of data in the region 228 of the persistentmemory 220, as well as on one or more disks 120 of the storage servers110. To that end, the user-level library 225 contains computerexecutable instructions executed by the processor 210 to performoperations that select a storage server on which to replicate the data.In addition, the library 225 contains computer executable instructionsexecuted by the processor 210 to perform operations that modify thepersistent memory 220 to provide, e.g., atomicity, consistency,isolation and durability (ACID) semantics or properties. The ACIDproperties of the STM system are illustratively implemented in thecontext of transactions, such as transaction 224, which atomically movedata structures (and their associated data) stored in the memory fromone correct ACID state to another. The STM system thus enables theapplication 222 to modify its data of a region 228 in a manner such thatthe data (e.g., data structure) moves atomically from one safeconsistent state to another consistent state in the persistent memory220.

Illustratively, the library 225 may cooperate with application 222 tocontrol access to the data stored in the region of the persistent memory220 as transaction 224 using the STM commit operation. In an embodiment,the application (i.e., thread) may initiate the transaction 224 byassembling all elements (data) that it intends to write for thattransaction; this is referred to as a read/write (r/w) set of thetransaction. For example, assume that the transaction 224 involvesinserting a new node into a doubly-linked list within region 228. Inaccordance with the byte addressability property of the persistentmemory 200, the application may render small, random modifications orchanges to the data; to that end, the elements of the r/w set that theapplication intends to write (change) may include a previous pointer, anext pointer, and a new node, thereby resulting in approximately 50bytes of changed data. The application 222 may then cooperate with theuser-level library 225 to execute the transaction in accordance with theSTM commit operation. Successful execution of the commit operation (andthe transaction) results in changing every element (datum) of the writeset simultaneously and atomically, thus ensuring that the contents ofthe persistent memory are safe and consistent. Notably, within thecontext of the transaction 224, the library 225 may precisely determinewhich bytes of the data have changed within the region 228, as well ashow and when the data bytes have changed. Armed with precise knowledgeof the context of the transaction, the library 225 may efficientlyreplicate (i.e., copy) the changed data at the granularity at which itwas modified, e.g., at the byte-addressable granularity.

In one or more embodiments, the library 225 may initiate replication ofthe data by forwarding the changed data to network adapter 230 of hostcomputer 200 as one or more splinters 300 associated with thetransaction 224. FIG. 3 is a block diagram of a splinter 300 that may beadvantageously used with one or more embodiments described herein.Illustratively, splinter 300 may contain information such as a startingor base memory address 310 of the changed data within the region, alength 320 of the changed data and a string of bytes 330 of the changeddata. Notably, the splinters 300 are created at the granularity of theactual individual bytes of data that are written. For example, referringto the node insertion transaction described above, three (3) splinterscontaining the changed data are illustratively created by the libraryand forwarded to the adapter in the context of transaction 224: a firstsplinter containing the base memory address, length and bytes of thenext pointer; a second splinter containing the base memory address,length and bytes of the previous pointer, and a third splintercontaining the base memory address, length and bytes of the newlyinserted node. Replicating changed data at the byte-addressablegranularity represents a substantial cost savings because time andcomputing resources, such as network bandwidth and network buffer space,are not wasted on replicating (copying) data that has not changed.Assume, for example, that changes or updates to the previous pointer andthe next pointer, as well as writing of the new node may result inapproximately 50 bytes of data. Therefore, instead of copying 12 kB ofdata in accordance with a previous page-based node insertion algorithm,the library 225 need only copy 50 bytes of data.

The network adapter 230 may thereafter forward each splinter 300 overcomputer network 150 to one of the plurality (cluster) of remote storageservers 110 having disks 120 organized as a replication group for theregion. In an embodiment, the splinter 300 may be created by the library225 in the context of the STM commit operation and forwarded over thenetwork 150 in accordance with a synchronous mode of replication. Here,the splinter is loaded (possibly individually) into a frame 170,processed by a network protocol stack of the operating system kernel 226and promptly transmitted by the network adapter 230 over the network 150to a storage server 110 serving a selected disk 120 of the region'sreplication group. According to the synchronous mode of replication, thelibrary 225 may wait for a response from the storage server (e.g.,indicating that the splinter was successfully stored on the selecteddisk) before the STM commit operation for the transaction completes(returns). Therefore when the commit returns, a successful transactionmay be guaranteed to be replicated, meaning that all splinters in thetransaction have been replicated (or none of them have been replicated).Illustratively, a 2-phase commit protocol may be employed such that ifreplication fails, the transaction fails and the failure (error) ispropagated to the application (via the library).

FIG. 4 is a block diagram of a replication group 400 that may beadvantageously used with one or more embodiments described herein. Thereplication group 400 is associated with region 228 and may be organizedby, e.g., assignment of a predetermined number of disks 120 attached toa number of remote storage servers 110. The assignment of disks to thereplication group is illustratively performed by the master server 160.In one embodiment, the number of storage servers 110 included within thereplication group 400 may equal the number of disks assigned to thereplication group, such that each storage server 110 serves one disk120. In other embodiments, the number of storage servers may not equalthe number of disks of the replication group, such that one storageserver 110 may serve more than one disk 120; illustratively, theselatter embodiments are dependent upon the bandwidth available to theserver/disks. Notably, the splinters 300 of transaction 224 are subsumedwithin the region 228 and are distributed substantially uniformly overthe disks 120 of the replication group 400.

In one or more embodiments, selection of disk 120 within the replicationgroup 400 to store the splinter 300 may be determined in accordance withan equivalence class technique. Illustratively, the equivalence classtechnique may logically apportion an address space of the region 228, asdefined by a multi-bit memory address, into a number of equivalenceclasses defined by a predetermined number of high bits of the multi-bitmemory address. In addition, each equivalence class may have a storagesubspace defined by a predetermined number of low bits of the multi-bitmemory address. The equivalence classes may then be mapped to the disks120 of the replication group 400 using modulus arithmetic, e.g., [memoryaddress] mod [number of equivalence classes], where the number ofequivalence classes is greater than or equal to the number of disks. Themapping results in assignment of a plurality of subspaces per disk,illustratively in a round-robin manner, such that each storage server110 is responsible for a disjoint subset of equivalence classes. Theproportion of subspaces that a storage server is assigned may bedirectly proportional to the number of disks that it contributes to thereplication group 400. The union of the subspaces served by the storageservers is therefore a complete image of the region 228.

The selected disk of the replication group 400 may thereafter bedetermined by mapping the base memory address 310 of the splinter 300 tothe assigned storage subspaces of the disks again using modulusarithmetic. Here, the low address bits n are ignored when calculatingthe modulus. The remaining number m of high address bits is used to mapthe splinter to the selected disk by taking the modulus of the remaininghigh address bits with respect to the number of equivalence classes. Themapping results in forwarding of the splinter 300 to the selected disk110 based on the subspace assigned to the disk. Illustratively, thepersistent memory 220 may include a plurality of queues 232 configuredto store the splinter 300 prior to forwarding of the splinter 300 to thenetwork adapter 230 as, e.g., frame 170. In an embodiment, the number ofqueues 232 may equal the number of disks assigned to the replicationgroup 400, such that each queue 0-D is associated with a correspondingdisk 0-D of the replication group. Accordingly, the library 225 mayillustratively organize the queues 232 according to the disks 120 of thereplication group 400.

For example, assume that the replication group is assigned apredetermined number d of disks, wherein d is illustratively 10, suchthat there are 10 disks per replication group. Assume further that thata predetermined number of equivalence classes c is selected such thatthe number of disks is less than or equal to the number of equivalenceclasses (i.e., c≥d). Also assume that the multi-bit memory address isillustratively a 32-bit memory address (i.e., a pointer) and that theregion has a 32-bit address space (e.g., as defined by the 32-bit memoryaddress pointer). A predetermined number n of low memory address bits,wherein n is illustratively 20, is used to create a sub-address space(“subspace”) having a capacity of 2^(n) (i.e., 2²⁰) or 1 MB. Theremaining number m of high memory address bits, wherein m isillustratively 12, is used to create 2^(m) (i.e., 2¹²) or 4096 (4 k)sub-spaces distributed over the c number of equivalence classes. Thatis, 4096 (2¹²) sub-spaces each 1M (2²⁰) in size are distributed, e.g.,uniformly, over the c number of equivalence classes. According to thetechnique, the distribution of sub-spaces across the equivalence classesmay be achieved uniformly by using modulus arithmetic such that subspacex is in equivalence class y if and only if x mod c=y, i.e., [m highmemory address bits] mod [c number of equivalence classes]=mappedequivalence class; for example, 4095 (a subspace number of the 4096subspaces numbered 0 to 4095) mod 10 (number of equivalence classes)=5(equivalence class number), so that subspace number 4095 maps toequivalence class number 5. The mapping results in an initial assignmentof approximately 410 1 MB subspaces per disk (i.e., 6 disks×410subspaces+4 disks×409 subspaces=4096 subspaces across 10 disks, in theabove example where c=d=10). The selection of the disk (as well as thequeue 232) to receive the splinter 300 may be determined in a mannersimilar to mapping of the subspaces to the equivalence classes, i.e.,[equivalence classs number] mod [number of disks], where the number ofdisks is less than or equal to the number of equivalence classes (i.e.,c≥d); for example, equivalence class number 5 maps to disk number 5,i.e., 5 (equivalence class number) mod 10 (number of disks)=5 (disknumber).

According to the technique described herein, each splinter of atransaction is transmitted to one storage server and stored on one diskof the region's replication group. Yet, the splinters of the transactionmay be transmitted to different storage servers attached to the disks ofthe replication group. In other words, the splinters 300 carrying thechanged data (updates) associated with a single transaction, such astransaction 224, may be split up and loaded into different queues 232,and forwarded to different disks 120 of possibly different storageservers 110. For example, refer again to the node insertion transactiondescribed above where three (3) splinters are created by the user-levellibrary 225. A disk of the replication group associated with the regionis selected by taking the modulus of the base address of each splinter.Thus, the 3 splinters may be transmitted to 3 different disks becauseeach disk has responsibility for a disjoint subset (subspace) of theregion's address space. As a result, each frame 170 may be destined toone storage server (i.e., one disk) and the frame may be loaded with oneor more splinters having base addresses within the disk's assignedstorage subspace. The equivalence class technique therefore providesuniform distribution of the splinters 300 of the transaction 224 overthe disks 120 of the replication group 400.

In an embodiment, the master server 160 may include a memory configuredto store computer executable instructions executed by a processor toperform operations needed to manage the cluster of storage servers,including formation and management of the replication group 400. To thatend, the master server 160 maintains a storage repository, such as adatabase 420, of all storage servers and their attached disks within thecluster. Illustratively upon start-up or boot, a storage server 110 maybroadcast a message over the network 150 that attempts to locate themaster server 160. The master server may respond to the message byproviding its location to the storage server. The storage server maythen reply with certain characterizing parameters, e.g., an amount of(persistent) memory in the server, a number of disks attached to theserver, and available storage capacity of the disks. Over time, themaster server thus accumulates database 420 of all the storage serverson the network constituting the cluster.

The master server 160 may also cooperate with the library 225 toreplicate data changed in the region 228 of the persistent memory 220.For example, in response to application 222 creating region 228, thelibrary 225 may contact the master server 160, which assemblesreplication group 400 for the region. Illustratively, the master servermay assemble the replication group by assigning disks to the group in amanner that, e.g., matches the bandwidth of each disk 120 with thebandwidth of the network adapter 230 (and network 150) regardless ofwhether that requires one storage server or multiple storage servers.The master server 160 may then record information (such as the disks andtheir attached storage servers constituting the replication group) indatabase 420 and inform the library 225 as to which disks of the storageservers constitute the region's replication group. Thereafter, inresponse to changes to the data of the region, the library 225 mayselect a disk of the replication group to replicate the changed data byimplementing the equivalence class technique described herein. Notably,each region within persistent memory 220 of host computer 200 has anassociated replication group 400.

FIG. 5 is an example simplified procedure for replicating data stored ina byte-addressable, persistent memory of a host computer in a networkenvironment that may be advantageously used with the embodimentsdescribed herein. The procedure 500 begins at step 505 and proceeds tostep 510 where the library queries the master server for a replicationgroup in response to the application creating the region. At step 515,the master server consults its database and, at step 520, forms areplication group of disks for the region, as described herein. At step525, the master server forwards a message to the library containing alist of storage servers serving the disks of the replication group. Atstep 530, the library organizes (arranges) the queues according to thedisks of the replication group. At step 535, the application modifies(changes) data of the region in accordance with a transaction and, atstep 540, the library cooperates with the network adapter to initiatereplication of the changed data by transmitting the changed data overthe network as one or more splinters associated with the transaction. Atstep 545, a disk of the replication group is selected to store thesplinter (carried within a frame) in accordance with the equivalenceclass technique described herein to thereby replicate the changed data.The procedure then ends at step 550.

Advantageously, the remote storage servers 110 may be configured tostore off-host, redundant copies of the data on disks 120, which data isprimarily stored in persistent memory 220. These off-host, redundantcopies of the stored data are illustratively used for disaster recoverydeployments. When deployed as such, the use of disks is economicallyattractive, thereby enabling, e.g., petabytes of secondary, backingstorage on the disks of the remote storage servers in support ofterabytes of primary storage on persistent memory 220 in the hostcomputer.

While there have been shown and described illustrative embodiments forefficiently replicating data stored in a byte-addressable, persistentmemory of a host compute in a network environment, it is to beunderstood that various other adaptations and modifications may be madewithin the spirit and scope of the embodiments herein. For example,embodiments have been shown and described herein with relation to asynchronous mode of replication. However, the embodiments in theirbroader sense are not so limited, and may, in fact, allow theinformation contained with the splinter of the transaction to be storedon a disk of the replication group using an asynchronous mode ofreplication. Illustratively, the asynchronous mode separates the STMcommit operation from replication, i.e., returning from the commitoperation has no bearing on whether replication has succeed. Here, thecommit operation merely ensures that the splinter is loaded on anappropriate queue to continually pack a frame, e.g., an Ethernet frame,with other splinters destined to the selected disk of the replicationgroup to optimize for bandwidth. In other words, the asynchronous modeis configured to wait until the frame is filled with the splintersbefore transmitting the frame over the network to the selected disk,thereby substantially increasing throughput of the system. A completionnotification for the replication may be subsequently returned once thestorage server responds (e.g., indicating that the splinters in theframe were successfully stored on disk).

The foregoing description has been directed to specific embodiments. Itwill be apparent, however, that other variations and modifications maybe made to the described embodiments, with the attainment of some or allof their advantages. For instance, it is expressly contemplated thatstorage class memory as described herein may be selected from, amongothers: SONOS Flash, Nanocrystal Flash, Feroelectic RAM (FeRAM),Magnetic RAM (MRAM), Phase-Change RAM (PCRAM), Resistive RAM (RRAM),Solid Electrolyte RAM, and Polymer/Organic RAM.

It is equally contemplated that the components and/or elements describedherein can be implemented as software encoded on a tangible(non-transitory) computer-readable medium (e.g., disks and/or CDs)having program instructions executing on a computer, hardware, firmware,or a combination thereof. Accordingly this description is to be takenonly by way of example and not to otherwise limit the scope of theembodiments herein. Therefore, it is the object of the appended claimsto cover all such variations and modifications as come within the truespirit and scope of the embodiments herein.

What is claimed is:
 1. A system comprising: a cluster of storage serverscoupled to a network, the storage servers attached to at least onestorage device organized as a replication group; and a host computercoupled to the cluster of storage servers over the network, the hostcomputer including a processor connected to a storage class memory(SCM), the processor configured to execute an operating system kernel, auser-level library stored in the SCM, the user-level library to controlaccess to data stored in a region of the SCM and to change the data ofthe region at a byte-addressable granularity, the user-level libraryreplicating the changed data of the region to the replication group, theuser-level library configured to: define a storage subspace; assign thestorage subspace to the at least one storage device of the replicationgroup; and select the at least one storage device of the replicationgroup to receive the changed data to enable replication of the changeddata at the byte-addressable granularity.
 2. The system of claim 1wherein a logically contiguous address space of the region of the SCM ismapped into an address space of an application.
 3. The system of claim 1wherein the user-level library is further configured to configure theSCM as a software transactional memory (STM) system to store the data ofthe region as a transaction.
 4. The system of claim 3 wherein theuser-level library is further configured to perform operations thatmodify the SCM to provide atomicity, consistency, isolation anddurability (ACID) properties of the STM system.
 5. The system of claim 4wherein the ACID properties are implemented in a context of thetransaction to atomically move the data of the region from oneconsistent state to another consistent state in the SCM.
 6. The systemof claim 4 wherein the STM system is defined by a STM commit operationto ensure safe and consistent storage of the data in the region as wellas on the at least one storage device of the replication group.
 7. Thesystem of claim 6 wherein the user-level library further cooperates withthe application to control access to the data stored in the region ofthe SCM as the transaction using the STM commit operation.
 8. The systemof claim 7 wherein the application is configured to: initiate thetransaction by assembling elements of the data as a read/write set ofthe transaction; and render small, random modifications to the elementsof the set at the byte-addressable granularity of the SCM.
 9. The systemof claim 8 wherein the application and user-level library are configuredto: execute the transaction according to the STM commit operation tochange each element of the set simultaneously and atomically to ensurethat the data of the SCM are safe and consistent.
 10. The system ofclaim 9 wherein the user-level library configured within a context ofthe transaction is further configured to: determine which bytes of thedata changed in the region; determine how the bytes of the data changedin the region; and determine when the bytes of the data changed in theregion.
 11. The system of claim 10 wherein the user-level libraryconfigured within the context of the transaction is further configuredto: replicate the changed data at the byte-addressable granularity tothe replication group.
 12. A method comprising: organizing at least onestorage device attached to a cluster of storage servers as a replicationgroup; controlling access to data stored in a region of a storage classmemory (SCM) by a user-level library executing on a processor of a hostcomputer; defining a storage subspace; assigning the storage subspace tothe at least one storage device of the replication group; changing thedata of the region at a byte-addressable granularity; and selecting theat least one storage device of the replication group to receive thechanged data to enable replication of the changed data at thebyte-addressable granularity.
 13. The method of claim 12 furthercomprising: configuring the SCM as a software transactional memory (STM)system to store the data of the region as a transaction.
 14. The methodof claim 13 further comprising: performing operations that modify theSCM to provide atomicity, consistency, isolation and durability (ACID)properties of the STM system.
 15. The method of claim 14 furthercomprising: implementing the ACID properties in a context of thetransaction to atomically move the data of the region from oneconsistent state to another consistent state in the SCM.
 16. The methodof claim 14 further comprising: defining the STM system by a STM commitoperation to ensure safe and consistent storage of the data in theregion as well as on the at least one storage device of the replicationgroup.
 17. The method of claim 16 wherein controlling access to the datafurther comprises: controlling access to the data stored in the regionof the SCM as the transaction using the STM commit operation.
 18. Themethod of claim 17 further comprising: initiating the transaction byassembling elements of the data as a read/write set of the transaction;and rendering small, random modifications to the elements of the set atthe byte-addressable granularity of the SCM.
 19. The method of claim 18further comprising: executing the transaction according to the STMcommit operation to change each element of the set simultaneously andatomically to ensure that the data of the SCM are safe and consistent.20. A non-transitory computer readable medium including programinstructions for execution on a processor, the program instructionsconfigured to: organize at least one storage device attached to acluster of storage servers as a replication group; control access todata stored in a region of a storage class memory by a user-levellibrary executing on the processor; define a storage subspace; assignthe storage subspace to the at least one storage devices of thereplication group; change the data of the region at a byte-addressablegranularity; and select the at least one storage device of thereplication group to receive the changed data to enable replication ofthe changed data at the byte-addressable granularity.